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(54) Cache coherence network for a multiprocessor data processing system 



(57) A cache coherence network for transferring 
coherence messages between processor caches in a 
multiprocessor data processing system is provided. The 
network includes a plurality of processor caches associ- 
ated with a plurality of processors, and a binary logic tree 
circuit which can separately adapt each branch of the 
tree from a broadcast configuration during low iB^els of 
coherence traffic to a ring configuration during high lev- 
els of coherence traffic. A cache snoop-in input receives 
coherence messages and a snoop-out output outputs, 
at the most, one coherence message per current cyde 
of the network timing. A forward signal on a fonward out- 
put indicates that the associated cache is outputting a 



message on snoop-out during the cun-ent cycle. A cache 
outputs received messages in a queue on the snoop-out 
output, after determining any response message based 
on the received message. The binary logic tree circuit 
has a plurality of binary nodes connected in a binary tree 
structure. Each branch node has a snoop-in, a snoop- 
out, and a forward connected to each of a next higher 
level node and two lower level nodes. A fonward signal 
on a fonward output indicates that the associated node 
is outputting a message on snoop-out to the higher node 
during the current cycle. Each branch ends with multiple 
connections to a cache at the cache's snoop-In input, 
snoop-out output, and forward output. 
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Description 

The present inventi n relates in general to cache 
coherence networks for multiprocessor data processing 
systems. 5 

A cache coherence network connects a plurality of 
caches to provide the transmission of coherence mes- 
sages between the caches, which allows the caches to 
maintain memory coherence. A snoopy cache coher- 
ence mechanism is widely used and well understood as 10 
used in multprocessor systems. Snoopy cache coher- 
ence in multiprocessor systenfis use a single bus as a 
data transmission media. The single bus allows mes- 
sages and data to be broadcast to all caches on the bus 
at the same time. A cache monitors (snoops on) the bus is 
and automatically invalidates data it holds when the 
address of a write operation seen on the bus matches 
the address the cache holds, 

A single bus cache coherence network becomes 
impractical in medium-to-large multiprocessor systems. 20 
As the number of processors in the system increases, a 
significant load is placed on the bus to drive the larger 
capacity, and the volume of traffic on the bus is substan- 
tially increased. Consequently, cyde time of the snoopy 
bus scales linearly with the number of caches attached 25 
to the bus. At some point, the cycle time of the snoopy 
bus will become larger than the cycle time of the proces- 
sors themselves, resulting in a saturation of the bus. 
Combining this with the fixed throughput of one coher- 
ence message per cycle of the bus. the bus quickly sat- 30 
urates as the number of caches attached to the bus 
increases. Thus, there is a limit to the number of caches 
that can be maintained effectively on a single snoopy 
bus. What is needed is an interconnection network that 
can adapt under the heavy electrical loading and 35 
increased traffic conditions that may result in a large mul- 
tiprocessor system, thus, providing scalability to the sys- 
tem. It would be further desirable to provide an 
interconnection network that acts logically like, and 
affords a broadcast capability like, the snoopy bus. 40 

It is the object of the present invention to provide an 
adaptive, scalable cache coherence network for a data 
processing system which acts like a snoopy bus and 
which provides broadcast capability. 

The foregoing objects are achieved as is now 45 
described. According to the present invention as 
claimed, a cache coherence network for transferring 
coherence messages between processor caches in a 
multiprocessor data processing system is provided. The 
network includes a plurality of processor caches associ- so 
ated witii a plurality of processors, and a binary logic tree 
circuit which can separately adapt each branch of the 
tree from a broadcast configuration during low levels of 
coherence traffic to a ring configuration during high lev- 
els of coherence traffic. 55 

In at least a preferred embodiment, each cache has 
a snoop-in input, a snoop-out output, and a forward out- 
put, wherein the snoop-in input receives coherence mes- 
sages and the snoop-out output outputs, at the most, one 



coherence message per current cycle of the network tim- 
ing. A fonward signal on a forward output indicates that 
the associated cache is outputting a message on the 
snoop-out during the current cycle. A cache generates 
coherence messages according to a coherency protocol , 
and, further, each cache stores messages received on 
the snoop-in input in a message queue and outputs mes- 
sages loaded in the queue on the snoop-out output, after 
determining any response message based on tiie 
received message. 

The binary logic tree circuit has a plurality of binary 
nodes connected in a binary tree structure, starting at a 
top root node and having multiple branches formed of 
branch nodes positioned at multiple levels of a branch. 
Each branch node has a snoop-in, a snoop-out. and a 
forward output connected to each of a next higher level 
node and two lower level nodes, such that a branch node 
is connected to a higher node at a next higher level of 
the tree structure, and to a first lower node and second 
lower node at a next lower level of the tree structure. A 
fonward signal on a forward output indicates that the 
associated node is outputting a message on snoop-out 
to the higher node during the current cyde. Each branch 
ends with multiple connections to a cache at the cache's 
snoop-in input, snoop-out output, and fonward output, 
wherein the cache forms a bottom level node. 

The invention will best be understood by reference 
to the following detailed description of an illustrative 
embodiment when read in conjunction with the accom- 
panying drawings, wherein: 

Rgure 1 depicts a block diagram of a cache coher- 
ence network; 

Rgure 2 shows a schematic diagram of a preferred 
embodiment of a cache coherence network; 

Rgure 3 shows a schematic diagram of the logic cir- 
cuit of a preferred embodiment of a network node; 

Rgures 4 - 7 are the four possible port connection 
configurations of tiie logic circuit of Rgure 3, as it is 
used in the embodiment of Figure 2; 

Rgure 8 shows the connections and message 
transmission flow during a cyde of the cache coher- 
ence network. urxJer conditions of a first example; 

Rgure 9 shows the connections and message 
transmission flow during a cyde of the cache coher- 
ence network, under conditions of a second exam- 
ple; 

Rgure 10 shows the connections and message 
transmission flow during a cyde of the cache coher- 
ence network, under conditions of a tiiird exanple;. 

Rgure 1 1 shows a schematic diagram of a logic cir- 
cuit of a prefen-ed ennbodiment of a network node. 
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With reference riow to the figures and in particular 
with reference to Figure 1 . there is depicted a block dia- 
gram of a cache coherence network. Network logic tree 
1 0 is connected to a plurality of processor/caches Po-Pn. 
i. Each processor/cache Pj (Pn-i & Pj ^ Po) represents 
a processor with an associated cache, although the 
physical implementation may not have the cache integral 
to the processor as shown by the blocks in Figure 1 . The 
processor caches are also connected through a sepa- 
rate data communications bus (not shown) for transfer- 
ring data blocks of memory between the processors and 
the system's main memory 

As seen in Rgure 1 , each processor Po - Pn-1 has 
three connections to the network: snoop-out (SO). For- 
ward (F). snoop-in (SI). The F signal output from a proc- 
essor is a single bit signal. The SO and SI signals are 
multi-bit signals candied over a multi-bit bus. The informa- 
tion flowing over the network from the SO and SI ports 
is referred to as coherence traffic and can be divided into 
two categories: coherence requests and coherence 
responses. The requests and responses are in the form 
of packetized messages which travel in the network as 
a single uninterrupted unit. Coherence requests are ini- 
tiated by a cache in response to a main memory access 
by its processor. A coherence response typically is initi- 
ated by other caches responding to requests which they 
have received on their SI inputs. An example of a coher- 
ence request would be a message asking a cache to 
invalidate a block of data. For example, (tag id) DCache- 
block-f lush. An example of a coherence response would 
be an acknowledge message indicating the data-block 
has been invalidated in the cache. For example, Ack. <tag 
id). The coherence messages used in the cache coher- 
ence network of the present invention could take on 
many fornrts. including those well known and often used 
in current snoopy coherency schemes. 

The SO output is used for outputting a number of 
messages onto the network. The network is timed, so 
that a cache may output only one message during each 
cycle of the network timing. The cache may issue a new 
coherence request, or it may respond to a coherence 
request by generating a response, or it may simply pass 
on a request that it had received earli^ over its SI port. 
When a cache uses its SO port to output a coherence 
message, it requests participation in the coherence traf- 
fic over the network by negating its F signal. When a 
cache is not requesting participation in tiie coherence 
traffic, it always asserts its F signal and outputs a 
negated signal on tiie SO port (i.e.. SO = 0). 

A cache always receives coherence requests or 
responses from other caches on its SI input. A cache 
deletes a request it receives from the coherence traffic 
on the SI port, if it is one it had sent out earlier over the 
SO port to be issued to the other processors in the net- 
work. Suitat)le identification fields are placed within each 
coherence message when it is sent out from an SO port, 
thus enabling a receiving cache to identify the originating 
cache of the message. In this way a cache is able to 
identify its own messages which rt had sent out over tiie 



network at a previous cycle, and to delete the message. 
This message will be deleted regardless of whether tiie 
F signal is asserted at the time of receipt. 

A cache maintains a queue of incoming requests on 
5 its SI port. This queue (not shown) is necessary because 
over a given period of time tiie cache may be generating 
its own coherence messages faster than it can evaluate 
and/or rebroadcast the received messages. The cache 
will delete a message from the SI queue if the message's 
10 identification field shows it to be a message originating 
from that cache. 

In any cache coherence protocol which might be 
used with the preferred embodiment, tiie cache gener- 
ates a response message if a received message is rel- 
15 event to its own contents and warrants a response. In 
addition, the cache may either forward a received 
request out onto the network over its SO port, or ignore it. 

In accordance with the present invention, if the 
cache had asserted the F signal when it received a par- 
ticular coherence request, the next processor in the net- 
work must also have received that request (as explained 
below). In that case, tiiere Is no need for the cache to 
fonvard the message to the next cache in the network. If 
the cache had negated tiie F signal at the time it received 
tiie coherence request, and therefore had itself sourced 
a valid coherence message to its SO port simultane- 
ously, the cache had clipped tiie broadcast mechanism 
(as explained t^elow) and must fonvard tiie received 
coherence request to the next cache in the network. 
What constitutes the "next" cache in tiie network may be 
logically different than the physical makeup of tiie com- 
puter system. The "next" cache or processor is deci- 
phered from the logic of tiie network logic tree 1 0. which 
is made up of the network nodes. In the preferred embod- 
iment as shown in Rgure 2. it will be shown that, 
because of the logic circuitry, a "next" processor is the 
processor to the left of a given processor, and is labelled 
with a higher reference number (i.e. PI > PO) . But 
because of the network connection at the root node of 
tiie tree, PO is the "next" processor after processor P7. 

Along with saving the incoming message in the SI 
queue, the receiving cache saves tiie ojrrent state of the 
F signal at the time it receives the queued message. 
Preferably tiie F signal is saved with the message in tiie 
SI queue. To determine whether to fonvard a received 
message out onto the network, the cache will check the 
state of tiie F signal at the time that the coherence mes- 
sage was received, which was stored in the message 
queue at the same time as the message. 

Refemng now to Rgure 2, there is depicted a pre- 
fen-ed embodiment of an adaptable, scalable binary tree 
cache coherence network in a multiprocessor data 
processing system, according to the present invention. 
The network is comprised of eight processors and their 
associated caches. PO - P7, and tiie network nodes. 
N0DE1-7. Togetiier they form a network by which ttie 
processors PO - P7 can efficiently pass coherence mes- 
sages to maintain coherent memory within their caches. 
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This network is able to adapt to varying volumes and 
kinds of coherence messages being transmitted over the 
network The binary tree structure of the transmission 
network has a cycle time which scales to the logarithm 
of the number of caches (i.e.. processors) connected to 5 
the network. This enables the network of the present 
invention to be scalable to medium-sized to large*sized 
multiprocessor systems. When there is light traffic on the 
network, processors are able to broadcast coherence 
messages to other processors, providing quick and effi- 
cient cache coherence mechanism. As coherence traffic 
increases, the network is able to adapt and pass mes- 
sages in a ring-like manner to the next processor in the 
network. In that configuration, the network bandwidth is 
increased by allowing pipelining of coherence traffic. In 
fact, the throughput of coherent messages through the 
network can be as high as the number of caches in the 
network. Also, the ring connections substantially reduce 
driving requirements. Moreover, the network is also able 
to adapt to varying degrees of increased traffic by seg- 
menting itself into broadcast sections and ring sections, 
depending on the locality of increased traffic. 

The network logic tree 1 0 (in Rgure 1 ) is comprised 
of a plurality of network nodes connected together in a 
binary logic tree structure, and each of the processors 
of the multiprocessor system are connected at the leaves 
of the binary logic tree. In the prefen'ed embodiment of 
Rgure 2. the network logic tree comprises root node 
N0DE1 at the top level of the tree and branch nodes 
N0DE2-7 formed along branches at lower levels of the 
tree. 

Each network node N0DE1-7 is designed with an 
identical logic circuit, that which is depicted in Rgure 3. 
according to a preferred embodiment of the present 
invention. This circuit is the same circuit used in can^y 
look-ahead adder circuits. Therefore, the operation of 
this circuit is well understood and well known by those 
skilled in the art. The organization and operation of a 
binary logic tree using the can^y look-ahead circuit as the 
universal link has been described in the prior art. See, 
G.J. Lipovski, "An Organization For Optical Linkages 
Between Integrated Circuits", NCC 1977, which is incor- 
porated herein by reference. This paper describes the 
use of a Carry Look-ahead circuit in a binary logic tree 
to configure a broadcast or propagating link optical com- 
munication network. 

Network node 1 00 has three connections to a higher 
level node in the tree: SO. F, and SI; and six connections 
to two lower level nodes in tiie tree: SOq, Fq, and SIq 
connected to a first lower level node, and SO1, Fi. and 
SI1 connected to a second lower level node. Each SO 
and SI port is labelled with a w to indicate that the port 
accommodates w-bit-wide signals. Each of the F ports 
accomnnodates a 1 -bit-wide signal. 

The SI port has an arrow pointing into the node 1 00 
to show that the node receives messages from tine higher 
level node on that port. The SO and F ports have arrows 
pointing away from the node showing that thes are out- 
put signals from the node to a higher level node in the 



binary tree. Similarly tiie SIq and SI1 have arrows point- 
ing away from node 100 showing that they are outputs 
from node 100 and inputs (snoop-in) into their respective 
lower level nodes. Ports Fq. SOq, Fi, and SO1 are shown 
witii endows pointing into node 100 to indicate that they 
are outputs from the lower level nodes and inputs into 
node 100. 

The circuit of Rgure 3 is combinational, and has no 
registers witiiin it. The logic of the tree works as stipu- 
lated when all signals are valid and stable. However, tiie 
processors and caches which use the tree are independ- 
ently clocked circuits. In some system designs, it may 
tiierefore be necessary to provide queues at tiie ports of 
tiie tree and design an appropriate handshaking mech- 
anism for communication between a cache and its tree 
ports. The tree is clocked independentiy and works on 
tiie entries in front of the SO and F queues at its leaf 
ends. (In fact, a separate F queue is not necessary, if an 
empty SO queue implies an asserted F signal.) The tree 
forwards the data to caches over the SI ports. Addition- 
ally if delays through the tree are not acceptable (for tiie 
required cycle time of the tree), the tree can be pipelined 
by adding registers at appropriate levels of the tree. 

It should be noted that altiiough the circuit of Figure 
3 simply and eff icientiy provides the transmission con- 
nections required for tiie present invention, it will be 
appreciated by those skilled in tiie art that other circuit 
configurations which provide the same input and output 
connections to provide the same logical function could 
also be used in the present invention. For exannple, Rg- 
ure 11 is a schematic diagram of a logic circuit which 
may be used as a network node in an alternative embod- 
iment of tiie present invention. Also, the logic of the for- 
ward signals or tiie snoop-in/snoop-out signals could be 
inverted and tiie binary logic tree circuitry designed to 
operate on these inv^ed signals as will be appreciated 
by tiiose skilled in the art. 

The operation of tiie circuit in Rgure 3 is predicated 
on tiie states of the fonward signals Fq and Fi. Therefore, 
there are four possible configurations under which tiie 
logic circuit operates. These four configurations are 
shown in Rgures 4-7. 

Rgure 4 diagrams tiie connections between ports 
in node 100, when both fonvard signals from the lower 
level nodes are not asserted (i.e. Fq = F-i = 0). Because 
both nodes have negated their forward signals, the lower 
level nodes will be outputting coherence messages over 
tiieir SO ports. SOq will be f ansmitted to S1 1 through log- 
ical OR-gate 0R1 . The negated fonArard signals with turn 
off AND-gates AND1 , AND2 and AND3. This allows SO1 
to pass through 0R2 to SO. SI is directiy connected to 
SIq. 

The second configuration of Rgure 3 will produce a 
connection of ports in node 100 as diagramed in Figure 
5. In the second configuration, NodeO (tiie node con- 
nected to the right branch of node 100 and not shown) 
is not transmitting (i.e., it is forwarding) a coherence mes- 
sage to its next higher level node, in this case node 1 00. 
Therefore, nodeO has asserted its fonvard signal Fq. The 
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other node connected to node 1 00, nodel , is transmitting 
a message to the next higher node, node 100. and thus 
has negated its fonward signal F^. With Fi o o. AND3 
outputs F = 0, The asserted Fq allows SI to transmit 
through AND1 Into 0R1. Because, by definition with Fq 
asserted. SOq is not outputting any messages, only the 
output of AND1 is output at port Sli. Again, with Fi = 0, 
AND2 is closed and SOi passes through 0R2 to SO. 

Referring now to Rgure 6. there isdiagramed a third 
configuration of the logic circuit of Rgure 3. In this situ- 
ation nodeO is transmitting a message over the network 
and nodel is not: Fq = 0 and Fi = 1 . Fq closes AND3 to 
produce F = 0. Once again SI is directly connected to 
SIq. Because Fq is negated, it is transmitting messages 
over SOq. which is directly connected to Sh through 
0R1 . The negated Fq closes AND1 as an input into 0R1 . 
The asserted F^ allows SOq to pass through AND2 into 
0R2. By definition, an asserted Fi indicatesthat no mes- 
sages are output on SOi, and therefore, the output of 
AND2 passes through 0R2 to SO. 

The fourth possitjie configuration of the logic circuit 
of Figure 3 occurs when neither of the lower level nodes 
are transmitting messages to node 1 00. A diagram of the 
transmission connections for this configuration is shown 
in Figure 7. Here, Fq = Fi = 1. These inputs generate F 
= 1 from AND3. SI is directly connected to SIq. Fq is 
asserted, allowing SI to pass through AND1 and 0R1 to 
Sli. NodeO is not transmitting, so SOq does not pass 
through 0R1 to SOi. Although SOi is connected through 
0R2 to SO, and SOq is connected through AND2 and 
0R2 to SO, those connections are not shown to simplify 
the diagram of Figure 7 since neither node is transmit- 
ting any messages over their snoop-out port. 

Referring again back to Rgure 2, root node N0DE1 
is the top level node of the binary logic tree. The SO of 
N0DE1 is directly connected to the SI of NODEI. The 
two branches of the binary logic tree extending down 
from the root node to the next level nodes N0DE2. 
N0DE3 are comprised of three busses for delivering sig- 
nals. As can be seen from Rgure 2, the connections of 
NODEI to N0DE2 are equivalent to the connections 
from node 1 00 to nodeO, as described with Rgure 3, and 
the connections of NODEI to N0DE3 are equivalent to 
the connections of node 100 to nodel , as described with 
Rgure 3. 

From each node N0DE2, N0DE3. the binary tree 
again branches into two connections to the lower level 
nodes from each node N0DE2, and N0DE3. Each of the 
higher level connections from N0DE4-N0DE7 are con- 
nected to their associated next higher level node's lower 
level connections. The branch nodes N0DE4-N0DE7 In 
turn have two branch connections to the next lower level 
nodes, in this case, those nodes being the proces- 
sors/caches PO - P7. Each processor PO - P7 having its 
SO, F, and SI connected to the lower level connections 
of the next higher level node (i.e. NODE4-NODE7). 

For three examples of how the cache coherence net- 
work of the present invention adapts to coherence traffic 
on the network consider Figures 8-10. For the first 



example, consider the extreme case where every cache 
on the network is attempting to transmit a coherence 
message onto the network. In this extreme case, every 
cache must receive every other cache's message and 
5 potentially might respond with another message for each 
received message. Such a scenario forces the cache 
coherence network into a ring-type network where each 
cache passes a received message on to the next cache 
in the network after determining any response of its own 
10 to the message. 

In the example of Rgure 8, It can be seen that all 
caches are negating their forwarding signals (F = 0), so 
that they may transmit a coherence message out onto 
the network Consequently. N0DE4 - 7 will have negated 
15 fonward Inputs from the lower level nodes. Thus, the logic 
circuit of each node will create transmission connections 
equal to those shown in Rgure 4, as shown in Rgure 8. 
As can be seen from Rgure 4, N0DE4 - 7 will also 
negate their fonward signals, resulting in N0DE2 and 
20 N0DE3 being configured as Figure 4. Last, NODEI also 
has two negated foPA^ard signal inputs, configuring 
NODEI as Rgure 4. 

The dashed an^ows shown In Rgure 8 indicate data 
f tow within the network. As can be seen, with every cache 
25 in the network outputting a message on the network dur- 
ing this current cycle, each cache only transmits to the 
next cache in the network. For example, PO outputs its 
coherence message on its snoop-out (SO) . This an-ives 
at N0DE4 on Its SOq port, which is connected to its Sli 
30 port, which delivers PO's coherence message to PI at 
Its SI port. PI outputs its message on its SO port which 
arrives at SOi of N0DE4. This is transmitted to the SO 
port of N0DE4. and on to the node at the next higher 
level of the tree. In this case, the next higher node from 
35 N0DE4 is N0DE2. Here, the message arrives on the 
right branch leading to N0DE2. N0DE2 is configured to 
transfer this message back down the left branch to 
NODES. In turn, NODES connects its SI port to the SI 
port of the right branch node at the next lower level from 
40 NODES, in this case, that node being P2. It can be seen 
then, that the coherence message output from PI is 
transmitted through N0DE4. up to N0DE2, back down 
to NODES, and then an-iving at P2. 

By inspecting the transmission paths of the remain- 
45 der of the processors, It can be seen that each processor 
passes its coherence message on to only the next proc- 
essor in the network. Because that next processor is also 
transmitting a message onto the network, the message 
from the previous processor is necessarily clipped and 
50 is not sent on to any other processors in the system. This 
can be understood, with reference to Rgure 8. by notic- 
ing that data moves In one direction within the network. 
Because of the particular logic circuit used in the pre- 
fen-ed embodiment, data generally travels from the right- 
55 hand side of the network to the left-hand side before 
passing over the top of the tree to transmit to the remain- 
der of the tree on the right-hand side. Thus, In the pre- 
fen-ed embodiment, the next processor In the network is 
the next processor to the left. 
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In Figure 8, the network has formed a ring-network. 
In this network, each processor passes a network mes- 
sage on to the next processor. The caches continue to 
pass the message along to the next cache in the ring 
with each cycle of the network until every cache in the 5 
network has received the message. 

Referring now to Figure 9, there is depicted a dia- 
gram of the data flow for a second example within a pre- 
ferred embodiment of the cache coherence network of 
the present invention. In this extreme example, only one 10 
processor in the network is attempting to transmit a 
coherence message over the network during the current 
cycle. Because no other messages are being sent over 
the network during the current cycle, the one processor 
transmitting over the network is able to broadcast its is 
message to every other processor within this one cycle. 
Here, PI is transmitting a message, and, therefore, has 
negated its fonward signal. All other caches, having not 
transmitted a message, have asserted their fonward sig- 
nals. (P2 - P7 have F = 1) . Therefore, NODEsS - 7 are 20 
configured as shown in Figure 7. Each of these nodes 
assert their fonward signals. This results in NODES being 
configured as shown in Figure 7. N0DE4 receives a 
negated fonward signal from its left branch and an 
asserted forward signal from its right branch, coming 25 
from the processor nodes PI and PO. respectively. This 
places N0DE4 in the configuration of Rgure 5. N0DE2 
receives an asserted fonward signal from NODES and a 
negated forward signal from N0DE4, configuring it as 
shown in Figure 6. Similarly N0DE1 receives an 30 
asserted fonward signal from N0DE3 on its left branch, 
and a negated fonward signal from N0DE2 on its right 
branch, resulting in a configuration as shown in Figure 6. 

Given this structure of the network connections dur- 
ing the current cycle, the dashed arrows in Rgure 9 35 
describe the direction of coherence message transmis- 
sion from processor PI to the rest of the processors con- 
nected to the cache coherence network. The message 
output from PI 's SO port passes through N0DE4 up into 
N0DE2. where the message is transferred both back 40 
down the left branch from N0DE2 to NODES, and up 
through N0DE2 and up along the right branch of 
N0DE1. The message wrapping around from PI 
through NODES is then transferred back down both the 
left and right branches of NODES to processors P2 and 45 
P3. The message also is transmitted through SO of 
node2 along the right branch of N0DE1. This message 
is transfen-ed back down the left branch of N0DE1 to be 
broadcast back down the entire left-hand side of the 
binary logic tree so that P4 - P7 receive the message, so 
The message is also transmitted up through the SO port 
of N0DE1, which wraps back down through the right- 
hand branch of N0DE1 into N0DE2. and again down 
the right-hand branch of N0DE2 into N0DE4, where the 
message is passed down both branches of N0DE4 into ss 
PO and PI 

As can be seen from the above desaiption of Rgure 
9, the cache coherence network of the present invention 
was able to adapt itself to a broadcast network so that a 



single processor was able to broadcast the message to 
the entire network within one cycle of the cache coher- 
ence system. The message spreads out along the 
branches of the tree to all processors to the left of the 
broadcasting processor that are within the broadcaster's 
half of the binary logic tree. When the broadcasted mes- 
sage reaches the root node. N0DE1. the message is 
passed back down along the right-hand side of the 
broadcasting processor's half of the tree so that all proc- 
essors to the right of the broadcasting processor and its 
half of the tree receives the message. At the same time, 
the message is broadcast down from the root node to all 
processors in the entire other half of the binary logic tree. 
In the broadcast mode, the broadcasting processor will 
also receive its own message. It has been explained, the 
received message will contain an identification field 
which indicates to the broadcasting cache that the 
received message was its own. and thus, should be 
ignored. 

Referring now to Rgure 1 0, there is depicted a thid 
example of the connections and data transmission in a 
preferred embodiment of the cache coherence network 
of the present invention during a particular cycle of the 
network. This example shows how the present invention 
can adapt to provide a combination of the ring and broad- 
cast networks under conditions between the two 
extremes described in the examples of Rgure 8 and Rg- 
ure 9. 

In this example, for the current cycle, processors PI . 
P2, P4. and PS are transmitting coherence messages 
onto the network, as is indicated by their negated fonward 
signals. Processors PO. P3, P6, and P7are not transmit- 
ting onto the network during the current cycle, as is indi- 
cated by their asserted forward signals. 

A 0-1 fbnward signal input into N0DE4 configures it 
as Figure S. A 1 -0 fonward signal input into NODES con- 
figures it as Figure 6. A 0-0 fonvard signal input into 
NODES configures it as Rgure 4. A 1-1 fonward signal 
input into N0DE7 configures it as Rgure 7. The fonward 
signals from both N0DE4 and NODES are negated, con- 
figuring N0DE2 as seen in Figure 4. The fonward signal 
of NODES is negated and the fon/vard signal of N0DE7 
is asserted, configuring NODES as shown in Rgure 6. 
The fonvard signals of N0DE2 and NODES are both 
negated, configuring N0DE1 as seen in Figure 4. 

Processor P1's message will pass through N0DE4 
up the right branch of N0DE2. down the left branch into 
NODES, and down the right branch of NODES into proc- 
essor P2. Because processor P2 was also broadcasting 
a message during this cycle. NODES could not be con- 
figured to allow both PI and P2's message to be trans- 
ferred down the left branch of NODES into PS. Thus, PI 's 
message is clipped at P2, and P2 must maintain Pi's 
coherence message in its SI queue to be retransmitted 
to the rest of the network on tiie next or a succeeding 
cycle. Also. Pi's message was not transmitted up 
through SO of N0DE2 because another processor to the 
left of PI at a leaf of the N0DE2 branch was also trans- 
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mitting, and, therefore, was given the connection to the 
snoop-out of N0DE2. 

As can be seen from Rgure 10, P2's message 
passed back down the left branch of NODES to P3 and 
up through SO of NODES, through to the SO of N0DE2 5 
to N0DE1. At N0DE1. the message was transmitted 
back down the left branch of N0DE1 to N0DE3. The 
message from P2 passes down the right branch of 
N0DE3 and the right branch of N0DE6 into P4. There 
the message is clipped because of P4's transmission. io 
P4's transmission is also clipped by PS's transmission, 
and therefore is passed only to N0DE6 and then back 
down to PS. 

PS's message passes up through N0DE6 and then 
up along the right branch of NODES, where it is sent both is 
back down the left branch of NODES, and up to the next 
higher level node. N0DE1. The message passing down 
the left branch of NODES is broadcast through N0DE7 
into processors P6 and P7. The message sent through 
the snoop-out of NODES routes back up through 20 
N0DE1. and down along the right branch of N0DE1 and 
N0DE2 into N0DE4. At N0DE4. the message of PS is 
broadcast to both PC and P1 . 

When using the network of the present embodiment, 
there is a danger that a cache initiating a coherence mes- 25 
sage might assert its fbnward signal during a cycle that 
one of its own messages currently pending on the net- 
work is delivered back to it. The problem arises in that 
now the message has been passed over to a subsequent 
processor once again, which will refonward it throughout 30 
the network, and that this could continue forever since 
the initiating cache may continue to assert its fonward sig- 
nal. 

To correct for this danger of a continuously for- 
warded request, additional logic can be added to the net- 35 
work. At the root level node in the network, the path that 
fon/vards data from the left half of the tree to the right half 
of the tree would have a decrementer, and so would the 
path that goes in the opposite direction. Each coherence 
request sent out over the network would contain an addi- 40 
tional 2-bit value that is decremented every time the mes- 
sage traverses between the two-half trees. An additional 
bit in the request carries a flag stating that the request is 
valid or invalid. This flag bit is set to "true" by the initiating 
cache, whilethequeuevalueissetto2. The flag is turned 45 
to invalid when the count is already 0 when the request 
reaches the root node. All invalid requests received are 
to be discarded at once by a receiving cache. If a unitary 
coding of 2 is used, an easy inrplementation of the dec- 
rementers is a mere inverter. The right to left transfer so 
negates one bit of the two bit count value, and the left to 
right transfer negates the other. The logical OR-ing of the 
2-count bits as they come into the root node generates 
the valid bit. 

Another problem is that of detecting whether all ss 
caches have responded or deciding that no more caches 
would respond at a later time to a particular coherence 
message. This problem arises because of the adaptabil- 
ity of the cache coherence network. As coherence traffic 



increases, the number of messages clipped increases, 
which necessarily delays the transmission of requests 
and responses by additional cycles. This problem is best 
solved by means of a protocol and time-out mechanism 
that assumes an upper bound on the delay that each 
cache may introduce in the path of a message and of the 
corresponding response, assuming that each is clipped 
at every cycle, and that adding up these delays will pro- 
duce an upper bound on the time after which no 
responses may be expected by any caches in the net- 
work. 

Although the present invention has been described 
in a scheme based on a binary tree, the present invention 
can easily be generalized to any M-ary tree. It has been 
shown in the literature that a modified binary tree can be 
imbedded in a hypercube. See, "Scalability Of A Binary 
Tree On A Hypercube", S.R. Deshpande and P.M. Jen- 
evein, ICPP 1 986, incorporated herein by reference. This 
technique can be applied to achieving snoopy protocol 
in a hypercube based multiprocessor system. 

In summary, the cache coherence network of the 
present invention automatically adapts to the coherence 
traffic on the network to provide the most efficient trans- 
mission of coherence messages. The network adapts to 
a broadcast network or a ring network, or any combina- 
tion in between, as a function of which caches attached 
to the network are attempting to transmit coherence traf- 
fic on the network. Thus, branches of the binary logic 
tree with light coherence traffic may be predominately 
configured in a broadcast configuration to allow coher- 
ence messages to be quickly delivered to each cache 
within that branch. Still other branches with heavy coher- 
ence traffic will automatically adapt to this increased traf- 
fic and configure themselves predominately in a ring 
network. 

While the invention has been particularly shown and 
described with reference to a preferred embodiment, it 
will be understood by those skilled in the art that various 
changes in form and detail may be made therein without 
departing from the scope of the invention. 

Claims 

1 . A cache coherence network for transfemng coher- 
ence messages between processor caches in a mul- 
tiprocessor data processing system, the network 
comprising: 

a plurality of processor caches associated 
with a plurality of processors, each cache having a 
snoop-in input, a snoop-out output, and a fonward 
output, wherein the snoop-in input is an-anged to 
receive coherence messages and the snoop-out is 
arranged to output, at the most, one coherence mes- 
sage per current cycle of tiie network timing, and 
arranged so that a fonward signal on the f onward out- 
put indicates tiiat tiie cache is outputting a message 
on the snoop-out output during the current cycle, 
wherein a cache is arranged to generate coherence 
messages according to a coherency protocol, and. 



7 



13 



EP0 707 269 A1 



14 



further, wherein each cache is arranged to store 
messages received on the snoop-in input in a mes- 
sage queue and to output messages loaded in the 
queue on the snoop-out output, after determining 
any response message based on the received mes- 5 
sage; and 

a binary logic tree circuit having a plurality of 
binary nodes connected In a binary tree structure, 
starting at a top root node and having multiple 
branches formed of branch nodes positioned at mul- ro 
tiple levels of a branch, and each branch node hav- 
ing a snoop-in, a snoop-out, and a forward 
connected to each of a next higher level node and 
two lower level nodes, such that a branch node is 
connected to a higher node at a next higher level of 15 
the tree structure, and to a first lower node and sec- 
ond lower node at a next lower level of the tree struc- 
ture, and arranged so that a fonvard signal on a 
fonward indicates that the associated node is output- 
ting a message on snoop-out to the higher node dur- 20 
ing the current cycle, and wherein each branch ends 
with multiple connections to a cache at the cache's 
snoop-in input, snoop-out output, and forward out- 
put, wherein the cache forms a bottom level node. 



2. A cache coherence network as claimed in Claim 1 , 
wherein a node is arranged to transmit a message 
received on the snoop-in from the higher node to the 
snoop-in of the first lower level node, to transmit a 
message received on the snoop-out of the first lower 30 
level node to the snoop-in of the second lower level 
node, and to transmit a message received on the 
snoop-out of the second lower level node to the 
snoop-out going to the higher level node, when the 
first and second lower nodes are transmitting coher- 35 
ency messages during the current cycle, 

3. A cache coherence network of as claimed in Claim 
1 or Claim 2, wherein a node is arranged to transmit 

a message received on the snoop-in from the higher 4o 
node to the snoop-in of the first lower level node, and 
to transmit a message received on the snoop-out of 
the first lower level node to both the snoop-In of the 
second lower level node and the snoop-out going to 
the higher level node, when the first lower node is 45 
arranged to transmit a coherence message and the 
second lower node is not transmitting a coherency 
message during the current cycle. 



A cache coherence network as claimed in any pre- so 
ceding claim, wherein a node is arranged to transmit 
a message received on the snoop-in from the higher 
node to the snoop-in of the first lower level node and 
the snoop-in of the second lower level node, when 
the first and second lower nodes are not transmitting 55 
coherency messages during the current cycle. 

A cadie coherence network as claimed in any pre- 
ceding claim, wherein a node is arranged to transmit 
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a message received on the snoop-in from the higher 
node to both the snoop-in of the first lower level node 
and the snoop-in of the second lower I evel node, and 
to transmit a message received on the snoop-out of 
the second lower level node to the snoop-out going 
to the higher level node, when the first lower node is 
not transmitting a coherence message and the sec- 
ond lower node is transmitting a coherency mes- 
sage during the current cycle. 

A cache coherence network as claimed in any pre- 
ceding daim, wherein the root node has the snoop- 
out to the higher node connected to the snoop-in 
from the higher node. 

A cache coherence network as claimed in any pre- 
ceding claim, wherein a cache is arranged to assert 
a fonward signal on the fonward output when the 
cache is not transmitting a coherence message on 
the snoop-out output, and to negate the forward sig- 
nal on the fonvard output when the cache is trans- 
mitting a coherence message during the cun'ent 
cycle. 

A cache coherence network as claimed in any pre- 
ceding daim, wherein all nodes of the binary logic 
tree circuit are carry look-ahead circuits. 
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